Introduction

The purpose of this project is to study and predict the relationship between life expectancy and other related statistics (factors) of countries. Life expectancy is an estimate of how long a person would live on average for each country. As life expectancy is a very point indicate of people life quality, this study will help people understand how to improve people life quality. Also, because the observations in this data set come from different countries, it will be easier for a country to identify the predicting factor that contributes to a lower life expectancy value. This will assist in recommending to a country which areas should be prioritized in order to effectively raise the population’s life expectancy. The data of this project comes from the Global Health Observatory (GHO) data repository under World Health Organization (WHO). Globally, life expectancy has increased by more than 6 years between 2000 and 2019 – from 66.8 years in 2000 to 73.4 years in 2019. We hope to find what’s the most important issue behind that increase for life expectancy.

This project contains 2 parts:

The first part is to explore the relationships between these variables for model insight.

The second part is to build models to find the important predictors of life expectancy.

Data

Load and Clean Data

Here we load in the data, and present the first several rows of the data

Life_Exp <- read_csv(file = "Life Expectancy Data.csv") %>% na.omit() 
Life_Exp 
## # A tibble: 1,649 × 22
##    Country      Year Status    `Life expectan…` `Adult Mortali…` `infant deaths`
##    <chr>       <dbl> <chr>                <dbl>            <dbl>           <dbl>
##  1 Afghanistan  2015 Developi…             65                263              62
##  2 Afghanistan  2014 Developi…             59.9              271              64
##  3 Afghanistan  2013 Developi…             59.9              268              66
##  4 Afghanistan  2012 Developi…             59.5              272              69
##  5 Afghanistan  2011 Developi…             59.2              275              71
##  6 Afghanistan  2010 Developi…             58.8              279              74
##  7 Afghanistan  2009 Developi…             58.6              281              77
##  8 Afghanistan  2008 Developi…             58.1              287              80
##  9 Afghanistan  2007 Developi…             57.5              295              82
## 10 Afghanistan  2006 Developi…             57.3              295              84
## # … with 1,639 more rows, and 16 more variables: Alcohol <dbl>,
## #   `percentage expenditure` <dbl>, `Hepatitis B` <dbl>, Measles <dbl>,
## #   BMI <dbl>, `under-five deaths` <dbl>, Polio <dbl>,
## #   `Total expenditure` <dbl>, Diphtheria <dbl>, `HIV/AIDS` <dbl>, GDP <dbl>,
## #   Population <dbl>, `thinness  1-19 years` <dbl>, `thinness 5-9 years` <dbl>,
## #   `Income composition of resources` <dbl>, Schooling <dbl>

The original dataset contains 2938 observations and 22 variables. Since there are some missing values in the dataset, so we exclude related statistics in order to make the later process and models perform better.

After cleaning data, the dataset now contains 1649 observations and 22 variables. These 1649 observations are from 133 countries, and recorded their life expectancy from 2000 to 2015. Notice: Some countries miss some year.

Preview Data

The response variable of this dataset is Life expectany. This is a continuous variable ranged from 0 to 100.

The predictor variables have two main types:

  1. Numeric variables:
    • Years: ranged from 2000 to 2015
    • Statistics about longetivity and mortality: Adult Mortality, infant deaths, under-five deaths, Population
    • Statistics about economy: Total expenditure, GDP, Income composition of resources
    • Statistics about illness: Hepatitis B, Measles, Polio, Diphtheria, HIV/AIDS
    • Statistics about people: Schooling, thinness 1-19 years, thinness 5-9 years, BMI , Alcohol.
  1. Category variable:
    • Country
    • Status: Developing, Developed

For the detailed description of these data, the Adult.Mortality column represents the adult mortality rates of both genders, which is the probability of dying between 15 and 60 years per 1000 population. Infant.deaths shows the number of infant deaths per 1000 population. The Alcohol column describes the total litres of consumption for pure alcohol recorded per capita for ages 15 and older. The Hepatitis.B, Polio, and Diphteria variables shows the percentage of immunization coverage among 1-year-olds for these diseases. The Measles column represents the number of reported cases of the measles per 1000 population. thinness 1-19 years represents the percentage of thinness present in children ranging from the age of 10 to 19 years old. thinness 5-9 years represents the percentage of thinness present in children ranging from the age of 10 to 19 years old.

Simple Analysis of Response Variable

Life_Exp %>%
  summarise(average=mean(`Life expectancy`), 
            maximum=max(`Life expectancy`),
            minimum=min(`Life expectancy`))
## # A tibble: 1 × 3
##   average maximum minimum
##     <dbl>   <dbl>   <dbl>
## 1    69.3      89      44

The life expectancy ranges from 44 to 89, so the people in the country with the highest life expectancy live almost twice as long as the ones in the country with the lowest life expectancy. Also, the average life expectancy is 69.3023, which is close to 70. This value is much higher than the past decades. Thus, it’s significant to explore these situations and predict the future possibility.

Explorarotory data analysis

Life Expectancy

First, we can use histogram to roughly analyze the response variable “Life Expectancy” in the dataset, and later we will predict the life expectancy in our model.

ggplot(Life_Exp, aes(`Life expectancy`)) +
  geom_histogram(bins = 50, color = "white") +
  labs(title = "Histogram of Life Expectancy") +
  theme_bw()

From the histogram above, we can see that the distribution of life expectancy is a little light skewed. The range of life expectancy is from 44 to 89, and most life expectancy are concentrated from 60 to 80. Especially from 70 to 75, there is the highest count. Besides, the life expectancy, which are more than 85 and less than 50, have less count. Next, we will continue to explore the relationship between the life expectancy and other factors, which may affect the life expectancy.

Development status

For the next question, Do developed countries significantly have higher life expectancy than the undeveloped countries? Regarding life expectancy and developed/developing countries, we can use EDA method and statistic method to answer this question and explore the relationship between the life expectancy and development status.

First, we can use the histogram to directly display the the number of developed/developing countries.

status_counting <- Life_Exp %>%
    group_by(Status) %>%
    summarise(Count = n())

ggplot(data = status_counting, aes(x = Status, y = Count)) +
  geom_histogram(bins = 50, stat = "identity", fill = c("red", "blue")) +
  labs(title="Histogram of Development Status") +
  theme_bw()

From the histogram we can see that the statistics of developing countries (more than 1250)is much more than developed countries (less than 250). The big difference is formed because there are more developing countries in the world.

Then we can use a group boxplot to further illustrate that the developed countries has higher life expectancy. Here is the Life Expectancy against Development Status group box plot.

library(ggstatsplot)
plt <- ggbetweenstats(
  data = Life_Exp,
  x = Status, 
  y = "Life expectancy",
  plot.type = "box",
  type = "p",
  conf.level = 0.99,
  title = "Parametric test",
  bf.message = FALSE,
  results.subtitle = FALSE
)

plt <- plt + 
  # Add labels and title
  labs(
    x = "Development Status",
    y = "Life Expectancy",
    title = "Life Expectancy against Development Status"
  )
plt

From the boxplot, we can see there is a stark difference in the life expectancy in developing and developed countries. The mean of developed countries is 78.69, and the mean of developing countries is 67.69. The median of developed countries is around 78, and the median of developing countries is around 70. Thus, Both the mean and median for developed countries life expectancy is much higher than the developing ones. Besides, the range of life expectancy for developed countries is around between 70 and 90. The range of life expectancy for developing countries is around between 45 and 90. Thus, The minimum life expectancy is also much lower in developing countries.

We can also apply parametric t test to illustrate that difference is significant.

print("result of t test for difference of 2 groups")
## [1] "result of t test for difference of 2 groups"
t.test(Life_Exp$`Life expectancy`[Life_Exp$Status == "Developing"],
       Life_Exp$`Life expectancy`[Life_Exp$Status == "Developed"],
       alternative = "less")
## 
##  Welch Two Sample t-test
## 
## data:  Life_Exp$`Life expectancy`[Life_Exp$Status == "Developing"] and Life_Exp$`Life expectancy`[Life_Exp$Status == "Developed"]
## t = -31.117, df = 616.28, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -10.42181
## sample estimates:
## mean of x mean of y 
##  67.68735  78.69174

As the p value for the t test is smaller than 0.05, so we can reject the null under 0.05 significance level and conclude that the true difference between developing countries and developed countries is less than 0. Thus, we can see that the status of the country could affect life expectancy greatly. The people in the developed countries have larger life expectancy than the people in the developing countries.

Country

Then, we will explore the relationship between the life expectancy and countries. Different from only analyzing the development status, this part we will analyze the life expectancy among each different country in the dataset.

We can calculate the mean life expectancy for all countries, and then plot that value on the global map.

# obtain the mean by country
new_data <- Life_Exp
summary_data <- new_data %>% 
    group_by(Country) %>% 
    summarize(mean=mean(`Life expectancy`))

library(plotly)
plot <- plot_geo(summary_data, locationmode="country names")%>%
    add_trace(locations=~Country,
             z=~mean,
             color=~mean) %>%
    plotly::layout(autosize = T,
                   title = 'Mean Life Expectancy from 2000-2015 in map',
                   geo = list(showframe = TRUE,
                              showcoastlines = TRUE,
                              projection = list(type = 'Mercator')))
plot

From this global map, we can see that African countries especially mid-south African countries are almost the lowest life expectancy regions, and the Canada, Europe and Australia are the highest regions.

Also, We will use bar plot to find the countries which has the highest and lowest life expectancy.

options(repr.plot.width = 20, repr.plot.height = 20) 
Life_Exp %>%
  group_by(Country) %>%
  summarise(mean = mean(`Life expectancy`)) %>%
  ggplot(aes(x = mean, y = reorder(Country, mean))) +
  geom_bar(stat = 'identity') +
  labs(title = 'Mean Life Expectancy from 2000-2015', 
       x = 'Life Expectancy',
       y = 'Countries') 

summary_data %>%
  arrange(desc(mean))
## # A tibble: 133 × 2
##    Country      mean
##    <chr>       <dbl>
##  1 Ireland      83.4
##  2 Canada       82.2
##  3 France       82.2
##  4 Italy        82.2
##  5 Spain        82.0
##  6 Australia    81.9
##  7 Sweden       81.9
##  8 Austria      81.5
##  9 Netherlands  81.3
## 10 Greece       81.2
## # … with 123 more rows

From the bar plot above, we can see that Ireland has the highest Life Expectancy and Sierra Leone has the lowest Life Expectancy. This result of bar graph is consistent withe the calculation result. Also, Ireland is a European country, and Sierra Leone is a African country. Thus, this result is also consistent with the analysis of global map above.

Numerical variables

This part we will explore the relationship between the life expectancy with left numerical variables.

Overview num variables independently

First we can take a look at the overall situation of the variable independently.

summary(Life_Exp)
##    Country               Year         Status          Life expectancy
##  Length:1649        Min.   :2000   Length:1649        Min.   :44.0   
##  Class :character   1st Qu.:2005   Class :character   1st Qu.:64.4   
##  Mode  :character   Median :2008   Mode  :character   Median :71.7   
##                     Mean   :2008                      Mean   :69.3   
##                     3rd Qu.:2011                      3rd Qu.:75.0   
##                     Max.   :2015                      Max.   :89.0   
##  Adult Mortality infant deaths        Alcohol       percentage expenditure
##  Min.   :  1.0   Min.   :   0.00   Min.   : 0.010   Min.   :    0.00      
##  1st Qu.: 77.0   1st Qu.:   1.00   1st Qu.: 0.810   1st Qu.:   37.44      
##  Median :148.0   Median :   3.00   Median : 3.790   Median :  145.10      
##  Mean   :168.2   Mean   :  32.55   Mean   : 4.533   Mean   :  698.97      
##  3rd Qu.:227.0   3rd Qu.:  22.00   3rd Qu.: 7.340   3rd Qu.:  509.39      
##  Max.   :723.0   Max.   :1600.00   Max.   :17.870   Max.   :18961.35      
##   Hepatitis B       Measles            BMI        under-five deaths
##  Min.   : 2.00   Min.   :     0   Min.   : 2.00   Min.   :   0.00  
##  1st Qu.:74.00   1st Qu.:     0   1st Qu.:19.50   1st Qu.:   1.00  
##  Median :89.00   Median :    15   Median :43.70   Median :   4.00  
##  Mean   :79.22   Mean   :  2224   Mean   :38.13   Mean   :  44.22  
##  3rd Qu.:96.00   3rd Qu.:   373   3rd Qu.:55.80   3rd Qu.:  29.00  
##  Max.   :99.00   Max.   :131441   Max.   :77.10   Max.   :2100.00  
##      Polio       Total expenditure   Diphtheria       HIV/AIDS     
##  Min.   : 3.00   Min.   : 0.740    Min.   : 2.00   Min.   : 0.100  
##  1st Qu.:81.00   1st Qu.: 4.410    1st Qu.:82.00   1st Qu.: 0.100  
##  Median :93.00   Median : 5.840    Median :92.00   Median : 0.100  
##  Mean   :83.56   Mean   : 5.956    Mean   :84.16   Mean   : 1.984  
##  3rd Qu.:97.00   3rd Qu.: 7.470    3rd Qu.:97.00   3rd Qu.: 0.700  
##  Max.   :99.00   Max.   :14.390    Max.   :99.00   Max.   :50.600  
##       GDP              Population        thinness  1-19 years
##  Min.   :     1.68   Min.   :3.400e+01   Min.   : 0.100      
##  1st Qu.:   462.15   1st Qu.:1.919e+05   1st Qu.: 1.600      
##  Median :  1592.57   Median :1.420e+06   Median : 3.000      
##  Mean   :  5566.03   Mean   :1.465e+07   Mean   : 4.851      
##  3rd Qu.:  4718.51   3rd Qu.:7.659e+06   3rd Qu.: 7.100      
##  Max.   :119172.74   Max.   :1.294e+09   Max.   :27.200      
##  thinness 5-9 years Income composition of resources   Schooling    
##  Min.   : 0.100     Min.   :0.0000                  Min.   : 4.20  
##  1st Qu.: 1.700     1st Qu.:0.5090                  1st Qu.:10.30  
##  Median : 3.200     Median :0.6730                  Median :12.30  
##  Mean   : 4.908     Mean   :0.6316                  Mean   :12.12  
##  3rd Qu.: 7.100     3rd Qu.:0.7510                  3rd Qu.:14.00  
##  Max.   :28.200     Max.   :0.9360                  Max.   :20.70

Selected num variables analysis

Now let’s make a scatter plot matrix of the response variable life expectancy with some selected continuous variable: Year, GDP, Population, Schooling, thinness 1-19 years.

library(GGally)
ggpairs(Life_Exp %>%
          select(`Life expectancy`, Year, GDP, 
                 Population, Schooling, `thinness  1-19 years`))

From the scatter plot matrix, we can observe there is a slightly positive relationship between year and life expectancy.The life expectancy increase when year increase. Lowest life expectancy increase much more when GDP increase, so the GDP has a high effect on the life expectation. The life expectancy is not so affected by population. Very suprisingly, the schooling has the most positive correlation with life expectancy as 0.752, and from the scatter plot matrix there is a very clear linear trend between them. For the thinness 1-19 years, it is not a suprise to observe it has a slightly negative correlation with life expectancy.

Overview all num variables with life expectancy

Then, we will independently create the scatterplot for each predictor to clearly and directly determine the relationship between life expectancy and numeric variables. Also, the overview will help the final summary from EDA better.

Life_Exp %>%
  ggplot(aes(x=Year, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of Year vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a slightly positive relationship between Year and Life Expectancy. As Year increase, the life expectancy will slightly increase.

Life_Exp %>%
  ggplot(aes(x=`Adult Mortality`, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of Adult Mortality vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a obviously negative relationship between Adult Mortality and Life Expectancy. As adult mortality increase, the life expectancy will decrease.

Life_Exp %>%
  ggplot(aes(x=`infant deaths`, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of Infant Deaths vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a slightly negative relationship between Infant Deaths and Life Expectancy. As infant deaths increase, the life expectancy will decrease.

Life_Exp %>%
  ggplot(aes(x= Alcohol, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of Alcohol vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a positive relationship between Alcohol and Life Expectancy. As Alcohol increase, the life expectancy will increase.

Life_Exp %>%
  ggplot(aes(x=`percentage expenditure`, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of Percentage Expenditure vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a positive relationship between percentage expenditure and Life Expectancy. As percentage expenditure increase, the life expectancy will increase.

Life_Exp %>%
  ggplot(aes(x=`Hepatitis B`, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of Hepatitis B vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a slightly positive relationship between Hepatitis B and Life Expectancy. As Hepatitis B increase, the life expectancy will increase.

Life_Exp %>%
  ggplot(aes(x= Measles, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of Measles vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a slightly negative relationship between Measles and Life Expectancy. As Measles increase, the life expectancy will decrease.

Life_Exp %>%
  ggplot(aes(x= BMI, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of BMI vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a obviously positive relationship between BMI and Life Expectancy. As BMI increase, the life expectancy will increase.

Life_Exp %>%
  ggplot(aes(x= `under-five deaths`, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of under-five deaths vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a slightly negative relationship between under-five deaths and Life Expectancy. As under-five deaths increase, the life expectancy will decrease.

Life_Exp %>%
  ggplot(aes(x= Polio, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of Polio vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a slightly positive relationship between Polio and Life Expectancy. As Polio increase, the life expectancy will increase.

Life_Exp %>%
  ggplot(aes(x= `Total expenditure`, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of Total expenditure vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a slightly positive relationship between Total expenditure and Life Expectancy. As Total expenditure increase, the life expectancy will increase.

Life_Exp %>%
  ggplot(aes(x= Diphtheria, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of Diphtheria vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a slightly positive relationship between Diphtheria and Life Expectancy. As Diphtheria increase, the life expectancy will increase.

Life_Exp %>%
  ggplot(aes(x=`HIV/AIDS`, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of HIV/AIDS vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a obviously negative relationship between HIV/AIDS and Life Expectancy. As HIV/AIDS increase, the life expectancy will decrease.

Life_Exp %>%
  ggplot(aes(x= GDP, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of GDP vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a positive relationship between GDP and Life Expectancy. As GDP increase, the life expectancy will increase.

Life_Exp %>%
  ggplot(aes(x= Population, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of Population vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is almost no relationship between Population and Life Expectancy.

Life_Exp %>%
  ggplot(aes(x=`thinness  1-19 years`, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of thinness  1-19 years vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a obviously negative relationship between thinness 1-19 years and Life Expectancy. As thinness 1-19 years increase, the life expectancy will decrease.

Life_Exp %>%
  ggplot(aes(x=`thinness 5-9 years`, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of thinness 5-9 years vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a obviously negative relationship between thinness 5-9 years and Life Expectancy. As thinness 5-9 years increase, the life expectancy will decrease.

Life_Exp %>%
  ggplot(aes(x= `Income composition of resources`, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of Income composition of resources vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a obviously positive relationship between Income composition of resources and Life Expectancy. As Income composition of resources increase, the life expectancy will increase.

Life_Exp %>%
  ggplot(aes(x= Schooling, y=`Life expectancy`)) + 
  geom_point(alpha = 0.2) + 
  labs(title = 'scatterplot of Schooling vs Life Expectancy') +
  geom_smooth(method = 'lm', formula = 'y ~ x') +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

There is a obviously positive relationship between Schooling and Life Expectancy. As Schooling increase, the life expectancy will increase.

Correlation of all num variables

First, we obtain the correlation matrix of all the continuous variables except time to take a rough look at positive or negative correlation.

library(corrr)
cor_life_exp <- Life_Exp %>%
  select(-Country, -Year, -Status) %>%
  correlate()
rplot(cor_life_exp)

Next, we create a visualization of the matrix to further analyze the correlation among the variables. (With the specific correlation number, we can determine their correlation specifically.)

ggcorr(
  Life_Exp %>% select(-Country, -Year, -Status),
  cor_matrix = cor(Life_Exp %>% select(-Country, -Year, -Status), use = "pairwise"),
    label = T, 
    hjust = 0.95,
    angle = 0,
    size = 4,
    layout.exp = 3
)

There are several things to state:

  1. For the response variable life expectancy, the most negative correlated predictors are adult mortality, Hiv-aids and thinness. The most positive correlated predictors are schooling, income composition of resources, BMI and GDP.

  2. The infant death has 1 correlation with under five deaths, so we should only include the under 5 death. Also, for the 2 thinness variables, we should only include 1 as their correlation is 0.9. Same with the percentage expenditure and GDP. We should not include all of these predictors, because otherwise colinearlity will be introduced.

Then when we exclude these variables, we can obtain the remaining variables correlation network plot.

corrr::correlate(Life_Exp %>%
   select(-Country, -Year, -Status, -`infant deaths`,
   - `thinness 5-9 years`, -`Total expenditure`)) %>% corrr::network_plot()

From this network we can observe that the 3 vaccinate variable: Polio, Diphtheria and Hepatitis B are close in correlation. GDP, percentage expenditure, schooling, Alcohol, BMI are close in correlation. Measles, under 5 deaths and thiness are close in correlation.

Summary from EDA

  • The variables that life expectancy is most positively related are: schooling, and income composition of resources, BMI. The BMI seems a little bit wierd, because we know higher BMI will indicate obesity. This may correlated with other economy indicators.

  • The variables most negatively associated with life expectancy are: adult mortality and HIV/AIDS, thinness. These are not a surprise.

  • Life expectancy does not have much correlation with population.

  • Counties in Europe, Oceania, and North America has higher life expectancy, and countries in Africa has lower life expectancy.

  • Developed countries has higher life expectancy than undeveloped countries.

  • Life Expectancy range from 44 to 89, and most concentrated on from 70 to 75.

Data splitting

Now we are going to build models to fit the data. We first apply cross validation on the dataset. Because the data includes life expectancy from different countries, so the best train-test split should not applied on the whole dataset directly, but should applied on the countries.

Here we split the data with 80% train and 20% test.

set.seed(123)
countries = unique(Life_Exp$Country)
library(caret)

train <- sample(length(countries), floor(0.8 * length(countries)))
countries_train <- countries[train]
countries_test <- countries[-train]

Life_train = Life_Exp %>% dplyr::filter(
  Country %in%  countries_train
)

Life_test = Life_Exp %>% dplyr::filter(
  Country %in%  countries_test
)

Verify that the training and testing data sets have the appropriate number of observations.

dim(Life_Exp)
## [1] 1649   22
dim(Life_train)
## [1] 1313   22
dim(Life_test)
## [1] 336  22
# the number of observations for all data
a <- nrow(Life_Exp)
# the number of observations for training data
b <- nrow(Life_train)
# the number of observations for test data
c <- nrow(Life_test)
# the percentage of observations for training data
per_train <- b/a
print(paste('the percentage of observations for training data is', per_train))
## [1] "the percentage of observations for training data is 0.796240145542753"
# the percentage of observations for test data
per_test <- c/a
print(paste('the percentage of observations for test data is', per_test))
## [1] "the percentage of observations for test data is 0.203759854457247"

The probability of training data observations is 0.7962401, which is almost equal to prob=0.80, so the training and testing data sets have the appropriate number of observations.

For cross validation, we will use the caret package to achieve cross validation in model training.

Model fitting

Linear regression with feature selection

First we deselect the variable country and year, and also deselect these variable that may cause colinearity in the previous part.

dataLin = Life_train %>% 
    select(-Country) %>%
    select(-Year,  -`infant deaths`, - `thinness 5-9 years`, -`Total expenditure`)

Now we use the backward variable selection technique in linear regression to build the best model based on the training dataset.

library(leaps)

backward_sel <- regsubsets(x=`Life expectancy` ~.,data= dataLin, nvmax=17, method="backward")

back_summary <- summary(backward_sel)

back_summary_df <- data.frame(
   cp = back_summary$cp,
   ADJ.R2 = back_summary$adjr2
)
back_summary_df
##             cp    ADJ.R2
## 1  2542.136721 0.5027101
## 2   714.082446 0.7388223
## 3   329.839548 0.7885737
## 4   161.469184 0.8104622
## 5   120.604287 0.8158691
## 6    78.361997 0.8214630
## 7    51.333224 0.8250918
## 8    33.735217 0.8275019
## 9    22.668317 0.8290672
## 10   13.207885 0.8304260
## 11    9.577393 0.8310283
## 12   10.133117 0.8310864
## 13   11.415776 0.8310499
## 14   13.049681 0.8309674
## 15   15.007324 0.8308426
## 16   17.000000 0.8307131

Then based on \(C_p\) and adjusted R square, we should choose the model with 11 variables.

The backward selection exclude variable GDP, population, Polio, Hepatitis B, thiness.

So the final regression model include the variables: StatusDeveloping, Adult Mortality, Alcohol, percentage expenditure, Measles, BMI.

The model summary is here:

dataLin1 = dataLin %>% select(-GDP,-`thinness  1-19 years`, -Population, -Polio, -`Hepatitis B`)
model1 = lm(`Life expectancy` ~ .,data= dataLin1)
summary(model1)
## 
## Call:
## lm(formula = `Life expectancy` ~ ., data = dataLin1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4546  -2.1745   0.0615   2.4256  11.2883 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        5.563e+01  9.054e-01  61.446  < 2e-16 ***
## StatusDeveloping                  -2.093e+00  4.336e-01  -4.826 1.56e-06 ***
## `Adult Mortality`                 -1.565e-02  1.066e-03 -14.673  < 2e-16 ***
## Alcohol                           -1.330e-01  3.954e-02  -3.365 0.000789 ***
## `percentage expenditure`           3.462e-04  6.540e-05   5.294 1.40e-07 ***
## Measles                            2.682e-05  1.129e-05   2.375 0.017690 *  
## BMI                                3.988e-02  6.426e-03   6.206 7.31e-10 ***
## `under-five deaths`               -6.206e-03  1.289e-03  -4.814 1.65e-06 ***
## Diphtheria                         2.344e-02  5.348e-03   4.383 1.27e-05 ***
## `HIV/AIDS`                        -4.567e-01  1.891e-02 -24.149  < 2e-16 ***
## `Income composition of resources`  1.118e+01  9.805e-01  11.404  < 2e-16 ***
## Schooling                          7.520e-01  6.941e-02  10.835  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.699 on 1301 degrees of freedom
## Multiple R-squared:  0.8324, Adjusted R-squared:  0.831 
## F-statistic: 587.6 on 11 and 1301 DF,  p-value: < 2.2e-16

And the model RMSE for the train and test are:

library(caret)
dataLin2 = Life_test %>% 
    select(-Country) %>%
    select(-Year, 
           -`infant deaths`,
   - `thinness 5-9 years`, -`Total expenditure`, -GDP,-`thinness  1-19 years`, -Population, -Polio, -`Hepatitis B`)

predictions_train <- predict(model1, dataLin1)

predictions_test <- predict(model1, dataLin1)
# computing model performance metrics
result = data.frame(name = "linear regression",
            RMSE_train = RMSE(predictions_train, dataLin1$`Life expectancy`),
            RMSE_test = RMSE(predictions_test, dataLin2$`Life expectancy`))
result
##                name RMSE_train RMSE_test
## 1 linear regression   3.682352  11.84439

Lasso regression

Now use the LASSO to build the regression model. Here we use cross validation on the training set to determine the best lambda for LASSO regression.

library(glmnet)
set.seed(123)
cv_lambda_LASSO <- cv.glmnet(
  x = as.matrix(Life_train[,-c(1,2,3,4)]), y = as.matrix(Life_train[,4]),
  alpha = 1,
  lambda = exp(seq(-8, 3, 0.2)),
  family = 'gaussian'
)
lambda_min_MSE_LASSO <- round(cv_lambda_LASSO$lambda.min, 5)
plot(cv_lambda_LASSO, main = "Lambda selection by CV with LASSO\n\n")

plot(cv_lambda_LASSO$glmnet.fit, "lambda")
abline(v = log(lambda_min_MSE_LASSO), col = "red", lwd = 3, lty = 2)

So the best lambda should be 0.00034, thus the RMSE of the train and test dataset of LASSO regression model is:

model2 = glmnet(
  x = as.matrix(Life_train[,-c(1,2,3,4)]), y = as.matrix(Life_train[,4]),
  alpha = 1,
  lambda = lambda_min_MSE_LASSO,
  family = 'gaussian'
)

predictions_train <- predict(model2, newx = as.matrix(Life_train[,-c(1,2,3,4)]),s = lambda_min_MSE_LASSO)

predictions_test <- predict(model2, newx = as.matrix(Life_test[,-c(1,2,3,4)]),s = lambda_min_MSE_LASSO)
# computing model performance metrics
result[2,] = c("LASSO", 
               RMSE(predictions_train, Life_train$`Life expectancy`),
               RMSE(predictions_test, Life_test$`Life expectancy`))
result
##                name       RMSE_train        RMSE_test
## 1 linear regression 3.68235242874498 11.8443909382826
## 2             LASSO 3.58975517321894 3.67862498303843

Ridge regression

Now use the Ridge regression to build the regression model. Here we use cross validation on the training set to determine the best lambda for Ridge regression.

cv_lambda_ridge <- cv.glmnet(
  x = as.matrix(Life_train[,-c(1,2,3,4)]), y = as.matrix(Life_train[,4]),
  alpha = 0,
  lambda = exp(seq(-11, 8, 0.2)),
  family = 'gaussian'
)
lambda_min_MSE_ridge <- round(cv_lambda_ridge$lambda.min, 5)
plot(cv_lambda_ridge, main = "Lambda selection by CV with Ridge\n\n")

plot(cv_lambda_ridge$glmnet.fit, "lambda")
abline(v = log(lambda_min_MSE_ridge), col = "red", lwd = 3, lty = 2)

So the best ridge lambda is 0.00203 based on cross validation, thus the RMSE of the train and test dataset of ridge regression model is:

model3 = glmnet(
  x = as.matrix(Life_train[,-c(1,2,3,4)]), y = as.matrix(Life_train[,4]),
  alpha = 0,
  lambda = lambda_min_MSE_ridge,
  family = 'gaussian'
)

predictions_train <- predict(model3, newx = as.matrix(Life_train[,-c(1,2,3,4)]),s = lambda_min_MSE_ridge)

predictions_test <- predict(model3, newx = as.matrix(Life_test[,-c(1,2,3,4)]),s = lambda_min_MSE_ridge)
# computing model performance metrics
result[3,] = c("Ridge", 
               RMSE(predictions_train, Life_train$`Life expectancy`),
               RMSE(predictions_test, Life_test$`Life expectancy`))
result
##                name       RMSE_train        RMSE_test
## 1 linear regression 3.68235242874498 11.8443909382826
## 2             LASSO 3.58975517321894 3.67862498303843
## 3             Ridge 3.58971228410238 3.68009079788661

Random forest

Now we use cross validation to determine the best mtry parameter for random forest here.

rf.expand <- expand.grid(mtry = 8:13)
set.seed(123)
rf <- caret::train(as.data.frame(Life_train[,c(2,3,5:22)]), Life_train$`Life expectancy`, method = "rf",
    metric = "RMSE", trControl = trainControl(method = "cv"), tuneGrid = rf.expand)
rf
## Random Forest 
## 
## 1313 samples
##   20 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1182, 1181, 1181, 1181, 1181, 1183, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    8    1.718092  0.9637116  1.098883
##    9    1.710819  0.9640156  1.089222
##   10    1.710184  0.9638902  1.084850
##   11    1.706025  0.9639738  1.078061
##   12    1.703839  0.9639850  1.073612
##   13    1.705254  0.9638531  1.073017
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 12.
library(randomForest)
varImpPlot(rf$finalModel, type = 2, main = "Random Forest")

And the RMSE of random forest model is:

predictions_train <- predict(rf$finalModel, Life_train)

predictions_test <- predict(rf$finalModel, Life_test)
# computing model performance metrics
result[4,] = c("Random forest", 
               RMSE(predictions_train, Life_train$`Life expectancy`),
               RMSE(predictions_test, Life_test$`Life expectancy`))
result
##                name        RMSE_train        RMSE_test
## 1 linear regression  3.68235242874498 11.8443909382826
## 2             LASSO  3.58975517321894 3.67862498303843
## 3             Ridge  3.58971228410238 3.68009079788661
## 4     Random forest 0.691508912385754 2.68853765375303

So the best fitting model is the random forest, with the mtry = 12. The best model has the smallest RMSE on train 0.692 and test 2.689, better than the others.

Conclusion

From out exploratory data analysis and model fitting part, we can know that the random forest method performs much better than the regression methods. Of all these methods, the linear regression perform the worst. Also from the importance variable plot from random forest model, we can observe that the variable Aids/Hiv, income from resources, adult motality and schooling have the highest importance to life expectancy. We also checked that developing countries have on average lower life expectancy than developed countries.

The next steps can be the longitudinal data analysis. Because this data has a covariate time, and this data is a longitudinal data, so some complex model such as linear mixed model (random effects model).